Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 9656 Raise an error when check_cols has duplicated values #9698

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ariosramirez
Copy link

resolves #9656

Problem

Currently, the check strategy for Snapshots does not throw an error if there are duplicate column names specified in the check_cols config, and accidentally specifying a column twice results in a record being inserted even if none of the column values changed.

Problem described in #9656 (comment)

Solution

This PR adds validation to detect duplicate column names in the check_cols configuration of a snapshot. It raises an error if duplicates are found.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
  • This PR includes type annotations for new and modified functions

@ariosramirez ariosramirez requested a review from a team as a code owner February 28, 2024 17:04
@cla-bot cla-bot bot added the cla:yes label Feb 28, 2024
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

1 similar comment
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@dbeatty10 dbeatty10 added the community This PR is from a community member label Mar 22, 2024
@graciegoheen graciegoheen added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label May 28, 2024
@@ -52,6 +52,10 @@ def validate(cls, data):
if data.get("materialized") and data.get("materialized") != "snapshot":
raise ValidationError("A snapshot must have a materialized value of 'snapshot'")

# Validate if there are duplicate column names in check_cols.
if len(data.get("check_cols")) != len(set(data.get("check_cols"))):
Copy link

@jordivandooren jordivandooren Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though .get() is used, if check_cols is missing, this will result in an error because NoneType has no len().

Suggested change
if len(data.get("check_cols")) != len(set(data.get("check_cols"))):
if "check_cols" in data and len(data["check_cols"]) != len(set(data["check_cols"])):

But, judging by the code above it, this check should only apply in case of the check strategy, where the existence of this check_cols has been validated beforehand. I suggest moving this new validation to after line 40.

I expect the current implementation to break the breaks the timestamp strategy, because of the absence of check_cols.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jordivandooren Could you please review the latest changes? I updated my code, and everything is now up-to-date. The duplicated columns validation has been moved inside the final_validate function.

Copy link

cla-bot bot commented Sep 18, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: ariosramirez.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot cla-bot bot removed the cla:yes label Sep 18, 2024
Copy link

cla-bot bot commented Sep 19, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: ariosramirez.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

2 similar comments
Copy link

cla-bot bot commented Sep 19, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: ariosramirez.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Copy link

cla-bot bot commented Sep 19, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: ariosramirez.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Copy link

codecov bot commented Sep 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.99%. Comparing base (c6b8f7e) to head (f6a7335).
Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9698      +/-   ##
==========================================
+ Coverage   88.90%   88.99%   +0.09%     
==========================================
  Files         180      181       +1     
  Lines       22856    22974     +118     
==========================================
+ Hits        20319    20445     +126     
+ Misses       2537     2529       -8     
Flag Coverage Δ
integration 86.16% <66.66%> (+0.06%) ⬆️
unit 62.39% <100.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 62.39% <100.00%> (-0.03%) ⬇️
Integration Tests 86.16% <66.66%> (+0.06%) ⬆️

@dataders
Copy link
Contributor

  • check if your git client is configured with an email to sign commits git config --list | grep email

  • If not, set it up using git config --global user.email [email protected]

  • Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@ariosramirez did you do these steps already as well?

@QMalcolm
Copy link
Contributor

Hi @ariosramirez, I think I understand what the problem is.

I checked our CLA database, and it does look like you've signed it. What I believe is happening is that cla-bot checks every commit in the PR. A majority of your commits are signed with the email [email protected], and that email address seems correct. However, four commits (8082fa7, 5343684, 67d5cb4, and 7bd6cb5) are signed with the email [email protected]. It appears that email address is no longer associated with your github account. If you still have access to that email address, I recommend re-adding it to your list of associated email address in github (you can have multiple, I myself have five email address associated with my github account). If you no longer have access to that email address, you may need to modify the signature on those commits.

@QMalcolm
Copy link
Contributor

@ariosramirez Alternatively, you can squash the problematic (or all the commits) into one new commit. This should give all the work your current email, I believe

@ariosramirez ariosramirez force-pushed the ADAP-9656_raise_an_error_when_check_cols_has_duplicated_values branch from f6a7335 to b859dea Compare September 20, 2024 04:47
@cla-bot cla-bot bot added the cla:yes label Sep 20, 2024
@ariosramirez
Copy link
Author

QMalcolm

Thanks, @QMalcolm the squash commit worked well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes community This PR is from a community member ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Warn if duplicate columns are found in check Snapshot strategy
6 participants